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Identifying which parts of a Web-page contain target content (e.g., the portion of an 
online news page that contains the actual article) is a significant problem that must 
be addressed for many Web-based applications. Most approaches to this problem ... 

Keywords: conditional random fields, content identification, maximum entropy 
markov models, sequence labeling 
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Faced with growing knowledge management needs, enterprises are increasingly 
realizing the importance of interlinking critical business information distributed across 
structured and unstructured data sources. We present a novel system, called EROCS, 
for ... 
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December ACM Transactions on Asian Language Information Processing 
2007 (TALIP), Volume 6 Issue 4 
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Stemming words to (usually) remove suffixes has applications in text search, 
machine translation, document summarization, and text classification. For example, 
English stemming reduces the words "computer," "computing," "computation," and 
"computability" ... 

Keywords: Indonesian, information retrieval, stemming 
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May ACM Transactions on I nternet Technology (TOI T) , Volume 7 Issue 2 
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Bibliometrics: Downloads (6 Weeks): 27, Downloads (12 Months): 273, Citation Count: 1 

During the last few years, several studies on the characterization of the public Web 
space of various national domains have been published. The pages of a country are 
an interesting set for studying the characteristics of the Web because at the same ... 

Keywords: Web characterization, Web measurement 
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The primary business model behind Web search is based on textual advertising, 
where contextually relevant ads are displayed alongside search results. We address 
the problem of selecting these ads so that they are both relevant to the queries and 
profitable ... 
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In this paper, we examine an important recent rule-based information extraction (IE) 
technique named Boosted Wrapper Induction (BWI) by conducting experiments on a 
wider variety of tasks than previously studied, including tasks using several 
collections ... 
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To ensure high data quality, data warehouses must validate and cleanse incoming 
data tuples from external sources. In many situations, clean tuples must match 
acceptable tuples in reference tables. For example, product name and description 
fields ... 
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This paper describes the motivation, design and performance of Porcupine, a scalable 
mail server. The goal of Porcupine is to provide a highly available and scalable 
electronic mail service using a large cluster of commodity PCs. We designed 
Porcupine ... 
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balancing, replication 
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Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized 
newswire stories recently made available by Reuters, Ltd. for research purposes. Use 
of this data for research on text categorization requires a detailed understanding ... 
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We introduce a new text-indexing data structure, the String B-Tree, that can be seen 
as a link between some traditional external-memory and string-matching data 
structures. In a short phrase, it is a combination of B-trees and Patricia ... 

Keywords: B-tree, Patricia trie, external-memory data structure, prefix and range 
search, string searching and sorting, suffix array, suffix tree, text index 
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Precis queries represent a novel way of accessing data, wliicli combines ideas and 
tecliniques from tine fields of databases and information retrieval. They are free- 
form, keyword-based, queries on top of relational databases that generate entire 
multi-relation ... 
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Large repositories of 3D data are rapidly becoming available in several fields, 
including mechanical CAD, molecular biology, and computer graphics. As the number 
of 3D models grows, there is an increasing need for computer algorithms to help 
people find ... 
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Parallel corpora have become an essential resource for work in multilingual natural 
language processing. In this article, we report on our work using the STRAND system 
for mining parallel text on the World Wide Web, first reviewing the original 
algorithm ... 
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This updated course on simulating natural phenomena will cover the latest research 
and production techniques for simulating most of the elements of nature. The 
presenters will provide movie production, interactive simulation, and research 
perspectives ... 
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